[server] Add SLO classification dimensions to record-level delay metric#2736
Conversation
There was a problem hiding this comment.
Pull request overview
Adds SLO-tier classification dimensions to the server-side OTel record-level delay metric (ingestion.replication.record.delay) so operators can slice latency/SLOs by locality, partial-update status, and chunking without joining against external store config.
Changes:
- Extend
RecordLevelDelayOtelStats/RecordLevelDelayOtelMetricEntityto emit new dimensions:venice.region.locality,venice.partial_update.status,venice.chunking.status. - Plumb local region + store metadata into
HeartbeatVersionedStatsso the delay metric can be tagged at stats creation time. - Add unit + integration/E2E tests validating the new dimensions are present on emitted histograms.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/NearlineE2ELatencyTest.java | Adds an E2E assertion that at least one server emits the record-delay histogram with expected SLO dimensions. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatusTest.java | New unit test for the partial-update dimension enum wiring/value mapping. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatus.java | Introduces VenicePartialUpdateStatus dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Registers the new VENICE_PARTIAL_UPDATE_STATUS dimension name. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStatsTest.java | Updates/extends tests to validate added SLO dimensions on record-delay histograms. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntityTest.java | Updates metric-entity dimension expectations to include the new dimensions. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStatsTest.java | Adapts tests for new constructor args and validates record-delay dimension tagging. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java | Adds locality/partial-update/chunking tagging via base dimensions + per-region base dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntity.java | Expands the metric entity’s required dimension set to include SLO dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStats.java | Supplies local region and store-derived flags into RecordLevelDelayOtelStats creation. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java | Passes localRegionName into HeartbeatVersionedStats to enable locality computation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3fa3100 to
a15ff88
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bf30d8f to
d91cda3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
96b523b to
603612e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6b5d57a to
528808c
Compare
…ntry
The non-AA branch of HMS#initializeEntry was keying the HeartbeatKey
by `localRegionName` directly. On a server with the local region
unconfigured (`localRegionName == ""`), the resulting key carries
`region=""`. Two failure modes follow:
- The OTel base-dimension validator rejects empty values
(validateRequiredDimensions check 5), so
`MetricEntityStateFiveEnums.create`
throws when the per-record path tries to emit for that key.
- The per-record SIT path normalizes its source region via
`RegionUtils.normalizeRegionName(...)` (UNKNOWN_REGION on null/empty),
so the per-record key and the HMS-stored key would not match
on `equals` — `computeIfPresent` would silently no-op.
Run `localRegionName` through the same `RegionUtils.normalizeRegionName`
so unconfigured deployments end up with `region="unknown"` consistently
across both code paths.
Also convert a 4-line `//` comment in `RecordLevelDelayOtelStats` to
the `/* */` block form per house style.
VeniceServerWrapper was already enabling SERVER_RECORD_LEVEL_TIMESTAMP_ENABLED by default for integration tests. Pair it with SERVER_PER_RECORD_OTEL_METRICS_ENABLED so the per-record OTel emit path runs end-to-end in the same scope and any test inspecting `ingestion.replication.record.delay` (or related per-record SLO metrics) sees emissions without explicit setup.
The per-record `ingestion.replication.record.delay` metric carries
write_type / chunking_status / region_locality. The companion
`ingestion.replication.heartbeat.delay` (computed by the periodic
reporter from heartbeat control messages) lacked the same labels, so
SLO dashboards only had the per-record signal — heartbeat-derived lag
slices weren't comparable across tiers.
Bring the heartbeat-delay metric to parity:
- HeartbeatOtelMetricEntity: bump INGESTION_HEARTBEAT_DELAY's
dimension list to include venice.region.locality, venice.store.write_type,
venice.chunking.status. Locality joins as a per-region base dim
(deterministic function of source region for a given server),
matching RecordLevelDelayOtelStats's layout.
- HeartbeatOtelStats: switch the per-region cache from
MetricEntityStateThreeEnums to MetricEntityStateFiveEnums; null
locality coerces to REMOTE at emission so the metric still ships
a concrete label.
- HeartbeatVersionedStats.recordLeaderLag/recordFollowerLag: take
write_type / chunking_status / locality and forward to the
OTel-side recorder.
- HeartbeatMonitoringService.record(): pull labels off the
HeartbeatKey already carried in the heartbeat-timestamp map (the
keys are inserted with labels baked in by initializeEntry /
updateLagMonitor).
- Consolidate the legacy ReportLagFunction into RecordLatencyFunction
so both the heartbeat-lag and record-lag periodic paths use the
same key-based callback shape.
Test fixtures across HeartbeatOtelStatsTest and HeartbeatVersionedStatsTest
record with the default-value combo (REGULAR, UNCHUNKED, LOCAL) and assert
the corresponding attributes; targeted tests cover non-default combos in
the existing record-level suites.
NearlineE2ELatencyTest now iterates servers in BOTH child fabrics and
asserts LEADER replicas emit record-delay with locality=LOCAL on the
source-RT fabric AND locality=REMOTE on the cross-region fabric.
Followers always emit LOCAL today (they read from local VT) — left a
TODO to extend the assertion when followers gain cross-region tracking.
|
Thanks a ton, @m-nagarajan for the review! |
Bump the expected dimension set for INGESTION_HEARTBEAT_DELAY to match the new SLO-aware definition (locality, write_type, chunking_status added by the previous commit). Without this update the fixture's set comparison fails because the metric entity now exposes 9 dims, not 6.
The test asserts that LEADER replicas emit `locality=REMOTE` on the non-source fabric. Without `setNativeReplicationSourceFabric(...)`, the multi-region test framework leaves the source fabric ambiguous and dc1 leaders end up consuming dc1's local RT — every leader emits LOCAL and the cross-region assertion fails deterministically (observed: `Observed leader localities: [local]`). Pin the NR source explicitly to `childDatacenters.get(0).getRegionName()` so dc1 leaders pull from dc0's RT under NR. dc1 leader's source region is then dc0, and locality resolves to REMOTE.
…verage
The previous variant used a single dc0 Samza producer with NR-but-not-AA,
relying on dc1's leader to pull cross-region from dc0's RT. In practice
the multi-region test fixture didn't deterministically replicate dc0's
RT into dc1 within a reasonable wait window — dc1 often never served
the streaming records, so dc1 leaders never emitted any record-delay
points and the locality=REMOTE assertion failed deterministically
(observed locally even with 120s waits).
Replace with the AA + per-fabric Samza pattern used by
ActiveActiveReplicationForHybridTest:
- Enable active-active replication on the store. Each fabric's
leader pulls EVERY fabric's RT under AA, so a leader processing
a record from a remote-fabric RT emits locality=REMOTE.
- Use sendEmptyPushAndWait to materialize the hybrid version
instead of a batch push (no batch records needed for this test).
- Get a Samza producer per child fabric via
getSamzaProducerForStream(multiRegionMultiClusterWrapper, dcIndex,
storeName) and write streaming records to each region.
- Wait for cross-region replication (each fabric's router serves
records produced in BOTH fabrics) before the metric assertion.
Verified locally: testRecordLevelDelaySloDimensions now passes in
~35s with both LEADER+LOCAL and LEADER+REMOTE observed. FOLLOWER+LOCAL
also asserted; FOLLOWER+REMOTE remains TODO until follower-side
cross-region tracking lands.
reporter
Two intertwined changes:
1. Make the heartbeat reporter cycle configurable.
Production keeps the 60s default
(`SERVER_HEARTBEAT_REPORTER_INTERVAL_SECONDS`,
plumbed through `VeniceServerConfig#getHeartbeatReporterIntervalSeconds`).
E2E tests override to 1s in `VeniceServerWrapper` so heartbeat-derived
metrics show up well before test method timeouts and don't bottleneck
parameterized test runs.
2. Parameterize `testRecordLevelDelaySloDimensions` over (writeCompute,
chunking) booleans via DataProviderUtils' `Two-True-and-False` —
four runs covering the full SLO write_type × chunking_status matrix.
- Switched the value schema to `NAME_RECORD_V1_SCHEMA` (record type
with field defaults) so `setWriteComputationEnabled(true)` is
accepted by the controller; primitive value schemas are rejected.
- Per-param store names via `Utils.getUniqueString(...)` so the four
runs don't collide.
- Records are GenericRecords with firstName/lastName fields — the
dim values reflect store config at PCS construction, not record
wire format, so we don't need to actually trigger chunking via
record size or to send WC partial-update payloads.
- Assertions now compute expected `writeType` / `chunkingStatus`
from the params and filter the histogram points to that combo,
then assert leader emits BOTH localities (LOCAL + REMOTE under AA
cross-region pull) and follower emits LOCAL.
Verified locally — all four runs pass:
- (false, false) regular + unchunked 35.1s
- (true, false) write_compute + unchunked 32.2s
- (false, true) regular + chunked 29.3s
- (true, true) write_compute + chunked 32.9s
DaVinciClientTestWithHeartbeatReadyToServeCheckTest.testHybridStore hit its 240s method timeout deterministically on both parameterized runs after the global override took effect — likely contention on the heartbeat-timestamps map between the per-second reporter cycle and the test's heartbeat-lag-driven ready-to-serve checks slowed ingestion past the test budget. The config plumbing (SERVER_HEARTBEAT_REPORTER_INTERVAL_SECONDS in ConfigKeys / VeniceServerConfig / HeartbeatMonitoringService) stays in place — a future test that specifically inspects heartbeat-derived metrics can opt in by setting it on its own VeniceServerWrapper. The existing SLO E2E test doesn't need it: per-record OTel emit fires inline on every data record, independent of the reporter cycle.
DaVinciClientTestWithHeartbeatReadyToServeCheckTest.testHybridStore hung deterministically at the 240s method timeout after the HMS-side RegionUtils.normalizeRegionName fix landed (commit eb23c20). Root cause: the per-record entry point (PCS#getOrCreateCachedHeartbeatKey, called from LeaderFollowerStoreIngestionTask#trackRecordReceived) was still passing the raw region string. On a server where serverConfig.getRegionName() is empty, HMS#initializeEntry inserted HeartbeatKey.region="unknown" while the per-record path produced HeartbeatKey.region="" — different identities, so heartbeatTimestamps.computeIfPresent silently no-op'd in recordIngestionTimestamp, the heartbeat-lag-driven ready-to-serve check never tripped, and the DVC client's subscribed partition never became readable. Apply the same normalization at the per-record entry point so both paths converge on the same HeartbeatKey identity. Verified locally: - testHybridStore[0,2] both PASS (32s and 36s) - testRecordLevelDelaySloDimensions still PASS for all 4 (writeCompute, chunking) param combos (~30s each)
Reject values < 1 with a clear VeniceException at VeniceServerConfig construction. Without this, a misconfiguration of 0 or negative would make TimeUnit.SECONDS.sleep(...) throw IllegalArgumentException on every reporter cycle — caught as Throwable, logged, and the thread would tight-loop on the exception path. Failing fast at startup is much easier to diagnose than the symptomatic loop.
|
Thanks a lot, @m-nagarajan! 🙏 |
Problem Statement
The
ingestion.replication.record.delayOTel metric lacks dimensions needed for nearline write SLO tier classification. To measure SLO attainment per workload tier, operators must JOIN metric data with external store config — complex, fragile, and not supported natively in Grafana/PromQL.Solution
Add three SLO classification dimensions to
ingestion.replication.record.delay:venice.region.localitylocal,remoteserverConfig.getRegionName()venice.store.write_typeregular,write_computestore.isWriteComputationEnabled()venice.chunking.statuschunked,unchunkedversion.isChunkingEnabled()venice.region.localitystays a base dimension because it is a deterministic function of the source region — caller provides the resolved enum on the first record per region, and it is cached on that region's metric state.venice.store.write_typeandvenice.chunking.statusare emitted as dynamic dimensions (4th and 5th enum levels in a newMetricEntityStateFiveEnums). Chunking is a per-version flag (Version#isChunkingEnabled()) and the metric state is shared across versions for the store, so dynamic emission lets a single store accurately label per-version differences. Cardinality overhead is bounded: 2 × 2 = 4 extra series per(store, version, region)tuple.Where labels are resolved
Once at SIT init from the
Version(the only layer that owns per-version configuration), theStore, and the server's local region.PartitionConsumptionStatetakes the labels as mandatory constructor params (isWriteComputationEnabled,isChunked,localRegionName) and bakes them into every cachedHeartbeatKey, so the per-record OTel emit path reads pre-resolved enum references — no per-call string allocation, no per-callStore/Versionlookup.HeartbeatMonitoringService.updateLagMonitorresolves the same labels from theStore/Versionreturned bywaitVersionand constructs labeled keys at insertion (addLeaderLagMonitor/addFollowerLagMonitornow take the two enum params). The periodic record-lag path then iterates the map and pulls labels off the stored keys.HeartbeatKeypassenger-field designIdentity in the heartbeat map is
(storeName, version, partition, region). SLO labels (writeType,chunkingStatus,locality) are passenger fields excluded fromequals/hashCode/toString. This matters because:Map.putIfAbsentkeeps the first-stored key reference, so once a labeled key is in the map the labels are write-once-per-version.Within a single version's lifecycle these labels are immutable; cross-version flips are covered by the next push that creates a new
HeartbeatKeyset with current labels. SeeHeartbeatKeyTestfor the locked-down contract.Empty-region handling
localRegionNamecan be unset (empty string or null) on servers that aren't running with region config. Both PCS (getOrCreateCachedHeartbeatKey) and HMS (initializeEntry) skip locality resolution in that case and leave itnull— the alternative (defaulting toREMOTEat resolution time) would mislabel every region as remote. At the OTel emit site (RecordLevelDelayOtelStats#recordRecordDelayOtelMetrics), anulllocality is then defaulted toREMOTEso the metric still emits with a concrete label — from an unconfigured server's perspective every source region IS "not the local region".Other changes worth reviewing
LAG_MONITOR_UPDATEvalue onVeniceHeartbeatComponent. The catch block inHMS.updateLagMonitornow recordsrecordHeartbeatExceptionCount(LAG_MONITOR_UPDATE)so swallowed exceptions surface on the same dashboard as the other heartbeat-component exception counters.HeartbeatMonitoringServiceStatsis enforced non-null at HMS construction (Objects.requireNonNull).internal/venice-test-common/test-shard-assignments.json: triage move ofTestAdminSparkServerWithMultiServersfrom shard 28 → 85 so the (pre-existing)VeniceServerTest.testDropStorePartitionSynchronously120s-timeout flake's retry budget fits inside the 12-min wall introduced by [ci] Rebalance integration test shards into [5min, 6min] band #2784. Not part of the SLO work; bundled here to keep CI green.Code changes
ingestion.replication.record.delaymetric, gated byserver.record.level.timestamp.enabledandserver.per.record.otel.metrics.enabled(both default:false). No new config flags introduced.Concurrency-Specific Checks
final; HMS uses existingsynchronized initializeEntry.VeniceConcurrentHashMapfor per-region metric states — unchanged).Testing Done
New unit tests:
HeartbeatKeyTest— locks down the passenger-field contract (equals/hashCode/toStringignore labels;putIfAbsentkeeps first-stored key reference).VeniceStoreWriteTypeTest,MetricEntityStateFiveEnumsTest(incl.testGetAttributesWithDiagonalDimensionComboto catch swappedcomputeIfAbsentchains in the 5-level nestedEnumMap).RecordLevelDelayOtelStatsTest,RecordLevelDelayOtelMetricEntityTest,HeartbeatVersionedStatsTest.Extended existing tests:
PartitionConsumptionStateTest— three new tests assertinggetOrCreateCachedHeartbeatKeyresolves labels correctly across(WC + chunked + remote),(regular + unchunked + local), and the empty-local-region branch.HeartbeatMonitoringServiceTest.testUpdateLagMonitorBakesWriteComputeAndChunkedLabelsOnStoredKey— covers theWRITE_COMPUTE+CHUNKEDresolution branch inupdateLagMonitor.VeniceHeartbeatComponentTest— covers the newLAG_MONITOR_UPDATEenum value.NearlineE2ELatencyTest.testRecordLevelDelaySloDimensions— full nearline ingestion pipeline with the(REGULAR, UNCHUNKED, LOCAL)default-value combo end-to-end. 240s test timeout to cover worst-case 60s push + 30s read availability + 90s heartbeat reporter cycle. Assertssum > 0andmax > 0on the matched datapoint so acount >= 1butsum == 0regression can't slip through.Non-default value combos (
WRITE_COMPUTE/CHUNKED/REMOTE) are covered at the unit level rather than E2E to avoid the WC + chunked + cross-region setup cost.Does this PR introduce any user-facing or breaking changes?